library(cleaningtools)
library(dplyr)
my_raw_dataset <- cleaningtools::cleaningtools_raw_data
my_kobo_survey <- cleaningtools::cleaningtools_survey
my_kobo_choice <- cleaningtools::cleaningtools_choices01 - R framework with IMPACT - session 1
IMPACT R framework
The IMPACT R framework has been developed to meet the requirements of the IMPACT research cycle.
The ecosystem is a modular framework with two dimensions:
- a horizontal dimension that focuses on the outcome of a given step, and
- a vertical dimension that focuses on the content of a given step.
The framework is built around:
- 4 steps: cleaning, composition, analysis, outputs
- 4 verbs: check, add, create, review
- 2 adjectives: pipeable, independent
These elements will help to improve cooperation and collaboration between different teams while allowing modularity to adapt to each context and assessment.
Checking a dataset
check_outliers
my_log1 <- my_raw_dataset %>%
check_outliers(uuid_column = "X_uuid")In this example, there are:
checked_dataset: the raw dataset (with extra variables if needed)potential_outliers: a log of potential outliers
typeof(my_log1)[1] "list"
my_log1 %>%
names()[1] "checked_dataset" "potential_outliers"
The log has at least 4 columns:
uuid: the unique identifierissue: the issue being flaggedquestion: the name of the questionold_value: the value being flagged
my_log1$potential_outliers %>%
head()| uuid | issue | question | old_value |
|---|---|---|---|
| b5b1d37a-e27a-4c35-a0f5-2cde9d6dfd06 | outlier (normal distribution) | age_respondent_r | 86 |
| 956b5ed0-5a62-41b7-aec3-af93fbc5b494 | outlier (normal distribution) | age_respondent_r | 84 |
| 97ad6294-30c6-454e-a0b3-42126415b767 | outlier (log distribution) | age_respondent_r | 18 |
| e005e719-57c4-44a3-ac2f-5d6d1ff68831 | outlier (log distribution) | age_respondent_r | 18 |
| c9aaa542-118f-4e42-93de-fb0916572541 | outlier (normal distribution) | num_hh_member | 19 |
| 48e8896b-d1be-4600-8839-2d8b994ebcfb | outlier (normal distribution) | num_hh_member | 19 |
Outliers are defined as +/- 3 standard deviation from the mean.
For log outliers, log(x + 1) is used.
check_duplicate
my_log2 <- my_raw_dataset %>%
check_duplicate(uuid_column = "X_uuid")
my_log2$duplicate_log %>%
head()| uuid | old_value | question | issue |
|---|
There is no duplicate. The log is empty.
my_log3 <- my_raw_dataset %>%
check_outliers(uuid_column = "X_uuid") %>%
check_duplicate(uuid_column = "X_uuid")names(my_log3)[1] "checked_dataset" "potential_outliers" "duplicate_log"
my_log3$potential_outliers %>%
head()| uuid | issue | question | old_value |
|---|---|---|---|
| b5b1d37a-e27a-4c35-a0f5-2cde9d6dfd06 | outlier (normal distribution) | age_respondent_r | 86 |
| 956b5ed0-5a62-41b7-aec3-af93fbc5b494 | outlier (normal distribution) | age_respondent_r | 84 |
| 97ad6294-30c6-454e-a0b3-42126415b767 | outlier (log distribution) | age_respondent_r | 18 |
| e005e719-57c4-44a3-ac2f-5d6d1ff68831 | outlier (log distribution) | age_respondent_r | 18 |
| c9aaa542-118f-4e42-93de-fb0916572541 | outlier (normal distribution) | num_hh_member | 19 |
| 48e8896b-d1be-4600-8839-2d8b994ebcfb | outlier (normal distribution) | num_hh_member | 19 |
my_log3$duplicate_log %>%
head()| uuid | old_value | question | issue |
|---|
More checks
This an example of more checks that exist.
more_logs <- my_raw_dataset %>%
check_duplicate(uuid_column = "X_uuid") %>%
check_soft_duplicates(uuid_column = "X_uuid", kobo_survey = my_kobo_survey, sm_separator = ".") %>%
check_outliers(uuid_column = "X_uuid") %>%
check_value(uuid_column = "X_uuid") add_duration
more_logs$checked_dataset <- more_logs$checked_dataset %>%
add_duration(uuid_column = "X_uuid", start_column = "X.U.FEFF.start", end_column = "end")
more_logs$checked_dataset[1:6, c("start_date", "start_time", "end_date", "end_time", "days_diff", "duration")]| start_date | start_time | end_date | end_time | days_diff | duration |
|---|---|---|---|---|---|
| 2021-07-05 | 658.57 mins | 2021-07-05 | 696.68 mins | 0 days | 38.11 |
| 2021-07-05 | 608.90 mins | 2021-07-05 | 641.92 mins | 0 days | 33.02 |
| 2021-07-05 | 682.23 mins | 2021-07-05 | 726.43 mins | 0 days | 44.20 |
| 2021-07-04 | 1342.98 mins | 2021-07-04 | 1380.15 mins | 0 days | 37.17 |
| 2021-07-04 | 1391.62 mins | 2021-07-05 | 18.88 mins | 1 days | 67.26 |
| 2021-07-05 | 617.38 mins | 2021-07-05 | 756.52 mins | 0 days | 139.14 |
The duration is added to the checked_dataset in the list, not in the my_raw_dataset dataframe. The check_* functions are used in a pipe, so it needs the current dataset to be modified.
At the moment, add_duration takes very specific format. It will change in the future to become more robust and using lubridate.
check_duration can now be used with the previous checks.
more_logs <- more_logs %>%
check_duration(column_to_check = "duration", uuid_column = "X_uuid")As much as possible, check_* functions take default argument or the functions will be able to guess some information, e.g. the check_outliers function guesses some numerical values. Some functions need more information.
other/text columns
check_other needs the list of columns to be checked. It currently, it cannot detect the open text question. KOBO tool can be used.
other_columns_to_check <- my_kobo_survey %>%
dplyr::filter(type == "text") %>%
dplyr::filter(name %in% names(my_raw_dataset)) %>%
dplyr::pull(name)
more_logs <- more_logs %>%
check_others(uuid_column = "X_uuid", columns_to_check = other_columns_to_check) check_logical
In other cases, the check is specific and should be tailored to the dataset, for example, check_logical.
example_logic <- my_raw_dataset %>%
check_logical(uuid_column = "X_uuid",
check_to_perform = "primary_livelihood.employment == 1 & tot_expenses < 200000",
description = "primary_livelihood is employment but expenses less than 200 000",
columns_to_clean = c("primary_livelihood", "tot_expenses"))
example_logic$logical_xx %>%
head()| uuid | question | old_value | issue | check_id | check_binding |
|---|---|---|---|---|---|
| f1175d29-ce1f-43a7-b3d1-ee26cd1b8cdb | primary_livelihood | employment | primary_livelihood is employment but expenses less than 200 000 | logical_xx | logical_xx / f1175d29-ce1f-43a7-b3d1-ee26cd1b8cdb |
| f1175d29-ce1f-43a7-b3d1-ee26cd1b8cdb | tot_expenses | 125000 | primary_livelihood is employment but expenses less than 200 000 | logical_xx | logical_xx / f1175d29-ce1f-43a7-b3d1-ee26cd1b8cdb |
| e9f8b44c-c507-45a1-8d76-66d886437b8f | primary_livelihood | employment | primary_livelihood is employment but expenses less than 200 000 | logical_xx | logical_xx / e9f8b44c-c507-45a1-8d76-66d886437b8f |
| e9f8b44c-c507-45a1-8d76-66d886437b8f | tot_expenses | 175000 | primary_livelihood is employment but expenses less than 200 000 | logical_xx | logical_xx / e9f8b44c-c507-45a1-8d76-66d886437b8f |
| 994a60b8-e640-425c-9774-160651d7af04 | primary_livelihood | employment | primary_livelihood is employment but expenses less than 200 000 | logical_xx | logical_xx / 994a60b8-e640-425c-9774-160651d7af04 |
| 994a60b8-e640-425c-9774-160651d7af04 | tot_expenses | 175000 | primary_livelihood is employment but expenses less than 200 000 | logical_xx | logical_xx / 994a60b8-e640-425c-9774-160651d7af04 |
The log returns :
- uuid
- question: for all variables in columns_to_clean
- old value: for all variables in columns_to_clean
- issue
- check_id: logical check identifier
- check_binding: the combination of the check_id and the uuid.
One check can be flagged in several rows, in the example above, for each uuid, the primary_livelihood and tot_expenses are flagged.
Format for the test_to_perform should take the format based on tidyverse. That format is as if a new indicator is create with a mutate. That new indicator should be a logical (i.e. TRUE or FALSE) with TRUE being the value to flag.
my_raw_dataset %>%
dplyr::mutate(xxx = primary_livelihood == "employment" & tot_expenses < 1000000) %>%
dplyr::select(X_uuid, xxx, primary_livelihood, tot_expenses) %>%
head()| X_uuid | xxx | primary_livelihood | tot_expenses |
|---|---|---|---|
| dcf2753a-6ea2-40f5-b493-3527931ef96c | FALSE | loans support | 250000 |
| 8790ce5c-1c35-41a2-b3c0-538f937d5397 | TRUE | employment | 750000 |
| bb818e04-9c40-408e-919f-6b40ff1fdbb3 | FALSE | other | 250000 |
| 28b90cbb-2cf0-41c5-9ee1-1c719c0d4c02 | TRUE | employment | 600000 |
| 7f2a0c6a-529b-481f-963f-a96dca2ec034 | TRUE | employment | 500000 |
| b4f92064-12ea-4970-b0f5-fd309de1dda3 | FALSE | retirement_fund | 650000 |
The checked dataset will be return with one extra column, i.e. the logical variable with the name of the check_id.
example_logic$checked_dataset[1:6,tail(names(example_logic$checked_dataset))]| X_submission_time | X_notes | X_status | X_submitted_by | X_index | logical_xx |
|---|---|---|---|---|---|
| 2021-07-05T09:34:15 | [] | submitted_via_web | reach_irq | 1 | FALSE |
| 2021-07-05T09:34:09 | [] | submitted_via_web | reach_irq | 2 | FALSE |
| 2021-07-05T09:34:33 | [] | submitted_via_web | reach_irq | 3 | FALSE |
| 2021-07-05T09:32:58 | [] | submitted_via_web | reach_irq | 4 | FALSE |
| 2021-07-05T09:32:59 | [] | submitted_via_web | reach_irq | 5 | FALSE |
| 2021-07-05T09:37:33 | [] | submitted_via_web | reach_irq | 6 | FALSE |
check_to_perform is written as a character, between quotes and using with regular expression, e.g. escaping the quotes in “employment”.
example_logic2 <- my_raw_dataset %>%
check_logical(uuid_column = "X_uuid",
check_to_perform = "primary_livelihood == \"employment\" & tot_expenses < 1000000",
description = "primary_livelihood is rented but expenses less than 1 000 000")Warning in check_logical(., uuid_column = "X_uuid", check_to_perform
= "primary_livelihood == \"employment\" & tot_expenses < 1000000", :
columns_to_clean not shared, results may not be accurate
example_logic2$logical_xx %>%
head()| uuid | question | old_value | issue | check_id | check_binding |
|---|---|---|---|---|---|
| 8790ce5c-1c35-41a2-b3c0-538f937d5397 | primary_livelihood | employment | primary_livelihood is rented but expenses less than 1 000 000 | logical_xx | logical_xx / 8790ce5c-1c35-41a2-b3c0-538f937d5397 |
| 8790ce5c-1c35-41a2-b3c0-538f937d5397 | tot_expenses | 750000 | primary_livelihood is rented but expenses less than 1 000 000 | logical_xx | logical_xx / 8790ce5c-1c35-41a2-b3c0-538f937d5397 |
| 28b90cbb-2cf0-41c5-9ee1-1c719c0d4c02 | primary_livelihood | employment | primary_livelihood is rented but expenses less than 1 000 000 | logical_xx | logical_xx / 28b90cbb-2cf0-41c5-9ee1-1c719c0d4c02 |
| 28b90cbb-2cf0-41c5-9ee1-1c719c0d4c02 | tot_expenses | 600000 | primary_livelihood is rented but expenses less than 1 000 000 | logical_xx | logical_xx / 28b90cbb-2cf0-41c5-9ee1-1c719c0d4c02 |
| 7f2a0c6a-529b-481f-963f-a96dca2ec034 | primary_livelihood | employment | primary_livelihood is rented but expenses less than 1 000 000 | logical_xx | logical_xx / 7f2a0c6a-529b-481f-963f-a96dca2ec034 |
| 7f2a0c6a-529b-481f-963f-a96dca2ec034 | tot_expenses | 500000 | primary_livelihood is rented but expenses less than 1 000 000 | logical_xx | logical_xx / 7f2a0c6a-529b-481f-963f-a96dca2ec034 |
If you don’t include columns_to_clean the check_logical function will try to guess the variables. Not guarantee it will read or pick the correct names.
This inherits from reading a checklist in an Excel format. In Excel, there is no quote.
logical_check_list <- readxl::read_excel("inputs/01 - example - check_list.xlsx")
logical_check_list| check_id | description | check_to_perform | columns_to_clean |
|---|---|---|---|
| check_1 | primary_livelihood is employment but expenses less than 200000 | primary_livelihood.employment == 1 & tot_expenses < 200000 | primary_livelihood.employment, |
| tot_expenses | |||
| check_2 | acces water and tank emptied | access_water_enough == “totally_insufficient” & tank_emptied == “about_half” | access_water_enough, tank_emptied |
This list can then be used with check_logical_with_list.
more_logs <- more_logs %>%
check_logical_with_list(uuid_column = "X_uuid",
list_of_check = logical_check_list,
check_id_column = "check_id",
check_to_perform_column = "check_to_perform",
columns_to_clean_column = "columns_to_clean",
description_column = "description")Exercises
Try the following with a dataset:
- Perform a check to spot personal identifiable information
- Perform a check that will look at the percentages of missing value per observation and that will spot any observation that is different.
- Create logical check with the following checks. They should all be in an excel file.
- household number (variable: num_hh_member) is above 8.
- the water source for drinking water is bottled (variable: water_source_drinking, value: bottled) and the household always treat the drinking water (variable: treat_drink_water, value: always_treat).
- the water source for drinking water is bottled (variable: water_source_drinking, value: bottled) and one of the main reason for the the household to not meet its water needs is the water pressure (variable: access_water_enough_why_not, value: water_pressure, this is a select multiple)
- try add_duration on your dataset There is no specific answer just now.